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Abstract 

Background: With the advancement of next-generation sequencing and transcriptomics technologies, regulatory 
effects involving RNA, in particular RNA structural changes are being detected. These results often rely on RNA 
secondary structure predictions. However, current approaches to RNA secondary structure modelling produce 
predictions with a high variance in predictive accuracy, and we have little quantifiable knowledge about the 
reasons for these variances. 

Results: In this paper we explore a number of factors which can contribute to poor RNA secondary structure 
prediction quality. We establish a quantified relationship between alignment quality and loss of accuracy. Furthermore, 
we define two new measures to quantif/ uncertainty in alignment-based structure predictions. One of the measures 
improves on the "reliability score" reported by PPfold, and considers alignment uncertainty as well as base-pair 
probabilities. The other measure considers the information entropy for SCFGs over a space of input alignments. 

Conclusions: Our predictive accuracy improves on the PPfold reliability score. We can successfully characterize many of 
the underlying reasons for and variances in poor prediction. However, there is still variability unaccounted for, which 
we therefore suggest comes from the RNA secondary structure predictive model itself. 



Background 

RNA secondary structure prediction is still an important 
problem in computational biology. With the advent of 
next generation sequencing and RNA-seq technologies, 
many RNA structural changes are being found to play 
important roles in regulating gene expression [1,2]. Gene 
regulation studies can now be done on a genome-wide 
scale. In some cases RNA secondary structures can be 
experimentally determined on a genome-wide level [3], 
but these methods require RNA isolation and many not 
preserve in vivo structures. RNA secondary structure 
prediction programs are still often used to predict struc- 
tures across the genome [4]. The predicted secondary 
structures, and predicted structural changes, are being 
used to find relationships and suggest mechanisms in 
gene regulatory networks. 

Some methods for RNA secondary structure predic- 
tion only consider a single sequence as input. However, 
prediction quality can be improved by using multiple 
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sequences, assuming that RNA secondary structure is 
conserved through evolution. Even without a complex 
evolutionary model, these additional structural constraints 
provide valuable information on folding. Comparative 
methods for RNA secondary structure prediction are 
based on this observation, and use evolutionary informa- 
tion from multiple alignments to improve prediction 
quality. 

Methods for RNA secondary structure prediction gen- 
erally fall into two categories. Thermodynamic models 
make use of free-energy functions, which take experi- 
mentally determined energy parameters for individual 
structural elements. Dynamic programming is then used 
to find the secondary structure with the minimum free 
energy, which is reported as the predicted structure. 
This has been successfully implemented in programs 
such as RNAfold [5] and UNAfold [6]. Thermodynamic 
methods typically deal with the single-sequence predic- 
tion problem, but extensions such as RNAalifold [7] and 
PETfold [8] allow for comparative prediction. 

Stochastic context-free grammars (SCFGs), on the 
other hand, define a probability distribution over the 
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space of RNA secondary structures. Posterior decoding 
techniques are typically used to determine, for example, 
the maximum expected accuracy structure [9]. SCFGs 
have been used for the single-sequence prediction prob- 
lem [10], but their advantage comes through coupling 
with a molecular evolution model. The first comparative 
SCFG-based approach was developed in Pfold [11,12], 
where alignment column probabilities were determined 
through single and paired column evolution models, cal- 
culated via the Felsenstein pruning algorithm [13]. For a 
more complete review on RNA secondary structure pre- 
diction, see [14]. 

In genome-wide predictions of RNA secondary struc- 
ture, the accuracy of the secondary structure pre-diction 
program is rarely factored into analysis. Typically only 
the mean accuracy is reported in RNA secondary struc- 
ture prediction benchmarks [14], with the variance in 
the accuracy given little thought. Variance in accuracy is 
particularly problematic in the case of single-sequence 
prediction. Figure 1 shows the cumulative density func- 
tion of predictive accuracy for two single-sequence ap- 
plications of RNA secondary structure prediction, 
RNAfold and PPfold (a recent implementation of Pfold, 
[15]), on 443 sequences taken from the RNASTRAND 
database [16]. Additionally, a uniform (0,1) cumulative 
density function is shown for comparison. The figure il- 
lustrates that for sequences in this data set, the predict- 
ive accuracy of RNAfold and PPfold is not very much 
different from a random number between 0 and 1. 
When genome-wide RNA secondary structure predic- 
tion is done on a large number of single sequences, 
many predictions will be poor ones. 



Consequently, it is important to understand more 
about the variability of RNA secondary structure predic- 
tion programs. In comparative prediction, there are 
many sources of variability: alignment quality, the num- 
ber of sequences chosen and which alignment samples 
have been taken, the evolutionary relationships between 
sequences, as well as the ill-conditioned nature of the 
folding model itself. Understanding and quantifying 
these variances is key for biological applications that rely 
on these folding programs. Additionally, other bioinfor- 
matics software that utilize these folding programs- for 
example inverse RNA folding algorithms [17]- may 
often experience a fundamental limitation in perform- 
ance due to variance in structure prediction quality. 

Sequence alignment is a fundamental step in most 
comparative sequence analysis pipelines. The typical ap- 
proach is to create a single, trusted multiple alignment 
of the sequences using methods based on an artificial 
scoring scheme and heuristics to find a highly scoring 
alignment [18,19]. Although this methodology is suc- 
cessful when the alignment is well resolved, it has been 
shown in the context of downstream analyses that the 
end result can be highly sensitive to the choice of align- 
ment [20-22]. RNA secondary structure prediction 
methods take a variety of approaches with respect to 
possible errors in RNA alignments. Some methods (e.g. 
[23]) invoke a fold-and-align approach directly, where 
alignment is done simultaneously with structure predic- 
tion. Pfold, instead, takes a fixed alignment as input, but 
allocates a nite probability to a nucleotide being any 
other nucleotide; this makes the model more robust to 
poor alignment quality. Most modern methods (e.g. 
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Figure 1 Cumulative density of secondary structure prediction accuracy on 433 RNASTRAND structures. Performance accuracy on 433 
sequences from RNASTRAND with known secondary structure by RNAfold and PPfold. 
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[24]) still consider prediction from a single, fixed, align- 
ment. Recently, alignment-free methods have also been 
proposed [25]. However, even after considering poor 
alignment quality, there are many additional variances 
associated with poor comparative RNA secondary struc- 
ture prediction. 

Sequence selection is another important variable. 
Alignment methods may produce poor alignments due 
to poor individual sequences, which will in turn produce 
poor structure predictions. There have been methods 
developed to select homologous sequences, particularly 
[26], which is based on evolutionary models and struc- 
tural constraints. This is implemented in [27], which 
shows strong results in RNA secondary structure predic- 
tion can be found by selecting useful sequences. 

In this paper we consider the problem of variances in 
comparative RNA secondary structure prediction. We 
present statistical analyses of different variances includ- 
ing the relationship between structure prediction quality 
and chosen alignment distance from the reference align- 
ment, and a predictive algorithm for accuracy is pro- 
vided. Factors like the number of sequences in the 
alignment and the evolutionary distance of the se- 
quences are considered. Finally, a novel method is 
presented which extends information entropy for sto- 
chastic context-free grammars [28] to consider variation 
over alignments. 

Statistical alignment 

Statistical alignment [29] provides a solution to many of 
the issues encountered with the traditional approach of 
sequence alignment. It models sequence evolution as a 
stochastic process involving sequence insertions, dele- 
tions and character substitutions, which defines a prob- 
ability distribution over the alignments of the sequences. 
Using techniques such as Expectation Maximisation or 
Markov Chain Monte Carlo (MCMC), it is possible to 
either maximize the likelihood of the alignment, or gen- 
erate a representative set of putative alignments by sam- 
pling from the alignment distribution. 

Several statistical alignment implementations have 
emerged in past years [30-33], some of which allow 
co-sampling other entities such as the evolutionary 
tree or the locations of cis-regulatory motifs. Such 
methods can highlight homoplasy and alignment 
uncertainty with high accuracy or can be used to 
decrease alignment uncertainty effects in downstream 
analyses, for instance protein secondary structure 
prediction [34]. 

StatAlign is a statistical alignment software package 
[32] that allows joint Bayesian analysis of multiple 
alignments, phylogenetic trees and evolutionary pa- 
rameters. The background model for insertions and 
deletions is a modified version of the TKF92 model 



[35] as described in [34]. The indel model can be 
coupled with an arbitrary substitution model (many 
of which are distributed with the software, both for 
protein and nucleotide sequence data). The Bayesian 
analysis is based on MCMC, the transition kernels 
are improved versions of those in [34]. StatAlign gen- 
erates random samples from the joint posterior distri- 
bution of sequence alignments, evolutionary trees and 
model parameters. This high-dimensional joint distri- 
bution can be analysed in several ways, ranging from 
the simple statistics of marginalised single dimensions 
(e.g. the posterior distribution of a single rate param- 
eter) to the application of other tools to the 
alignment samples. 

Alignment and RNA secondary structure accuracy metrics 

To analyse variances of RNA secondary structure as 
alignment quality varies, we calculate a similarity score 
that measures how close a sample alignment is to the 
reference alignment. We use an alignment metric, taken 
from [36], which is generalised to an alignment method 
in [37]. 

Let as-^,s2 be an alignment of a sequence Si of length n 
to a sequence S2 of length m. Each column of can 

be expressed as pairs of the form ^4^4)' 

(-,50. We define 



^('^51,52) = |; (-,4) ^^51,52 1 , 



and 



sets which represent 'homology', 'insertion' and 'deletion' 
respectively. Given these sets, we define the distance be- 
tween two alignments a^^^ and a^^^ of two sequences 
Si and S2 to be 



nH a] 



■I 



For example, consider the case a'. 



k 

'Sl ,52 



(1) 

= , . Then we 



have n = m, \H(a^]nHU^ 



as there are no deletions, and III a'', , ]<^lia] 



0 

-- 0 as there 

are no insertions. This gives the distance between the align- 
ments as zero, as would be expected. 

Equation 1 can be generalised to sequence alignments 
with more than two sequences. Assuming now that 
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and are alignments of m sequences 5i, 5^ of lengths 
/^i, n^, we have 



m-l m 



(2) 



p=l q=p+l 



that is, summing all the pairwise alignment distances 
from Equation 1, This alignment metric if is then nor- 
malized and subtracted from 1 to produce a similarity 
score 



^ m 



(3) 



t=l 



The denominator of the fraction, (m-l)^^?^, is the 

t=i 

normalizing constant, the maximum that the alignment 
distance d{a^^a^) can be. The similarity score is 
bounded by 0 and 1, with 1 indicating that the sample 
alignment is identical to the reference alignment. 

RNA secondary structure metrics 

There are a wealth of available metrics on RNA second- 
ary structure [14,38]. Here we use sensitivity, positive 
predictive value (PPV), and F-score (the harmonic mean 
of sensitivity and PPV). Defining true positives (TP) as 
the number of base-pairs correctly predicted, false posi- 
tives (FP) as the number of true base-pairs not pre- 
dicted, and false negatives (FN) as the number of base- 
pairs predicted which are incorrect, we have 



Sensitivity 
PPV 
F-score 



TP 



TP + FN 
TP 

TP + FP 

2xTP 

2XTP + FN + FP 



The strength of these RNA secondary structure accur- 
acy metrics is that they are easy to interpret, and make it 
straightforward to compare methods across different 
datasets. An F-score of 1 would represent a structure 
prediction that was completely correct and an F-score of 
0 a structure prediction that only predicted incorrect 
base-pairs. 

Information entropy 

As we later develop calculations for information entropy 
for a set of alignments, here we outline the computation 
of information entropy for a single alignment. The infor- 
mation entropy // of a probability distribution P with a 



set of events A' is defined as: 



(4) 



Information entropy is a measure for the "spread" of 
the probability distribution, and has well-defined lower 
and upper bounds. The minimum entropy of 0 occurs 
when there is only one outcome with probability 1, and 
the maximum entropy of log2 (vi) occurs when there are 
n possible outcomes, each with probability Wn, that is 
the uniform distribution. XVlien the base of the loga- 
rithm is 2, the entropy is measured in bits. For a prob- 
ability distribution, an entropy of k bits indicates that 
the expected value of the information content of observ- 
ing a single outcome is k bits. In the context of second- 
ary structure prediction, a low entropy therefore 
indicates that few secondary structures dominate the 
probability space, whereas a high entropy indicates a 
more even probability distribution over possible second- 
ary structures. Thus, information entropy is a useful sin- 
gle quantity to characterize the underlying probability 
distribution of secondary structures. 

The information entropy of the probability distribution 
over the possible secondary structures generated by a 
phylo-SCFG can be obtained using expected rule fre- 
quencies, which can be computed using the inside- 
outside algorithm [28]. This is outlined here. 

Let the set of all derivations for the input alignment be 
O. Since the probability of a derivation d can be written as 
the product of the SCFG production rule probabilities and 
the phylogenetic probabilities, we can write the total prob- 
ability T of the grammar producing the input string as 



(5) 



d&0 



where Pg[<^ denotes the prior probabilities obtained from 
the SCFG part of the model, and P^ [d] are the likelihood 
factors obtained from the phylogenetic model. Condition- 
ing on producing the input string, the normalized prob- 
ability of a derivation d is Pold] = ^Pld] = j^PG[d]PT[d]^ 
Consequently, we have that the information entropy of 
the input alignment under the phylo-SCFG model is 



(6) 



d&0 



which can be written using Equation 5 as 

(7) 

that is, separating out the SCFG contribution and the 
phylogenetic contribution. To calculate the entropy in 
practice, firstly we use a simplified form of the SCFG 
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contribution. For a SCFG with set of production rules R, 
we can write the SCFG contribution in terms of the 
expected production rule frequency, 

Y^PmogMd]) = E ^ogMrmusesofr]. 

(8) 

Secondly, we can simplify the phylogenetic contribu- 
tion. Let G 7? be a SCFG rule which produces base 
pairs, and e 7? a SCFG rule which produces unpaired 
bases. Define 1^ (/, ;), the indicator function for whether 
the column pair (/, J) is emitted from a rule (i.e. pos- 
ition / and 7 form a pair), and 1^ (/), the indicator func- 
tion for whether column / is emitted from a rule (i.e. 
position / is unpaired). Finally, define (ra) as the fre- 
quency that rule is used in derivation d. Then 

^P[d] log^iPrld]) = ^P[d](^Yl ^og^iPrlclcunpairedf'^'"^) 
log,(Prk,c'|c,cf^('-'')) 

rb 

= ^ log2 {Pt W unpaired] ) 1^ {i)P[d] 

i d&0 

+^ \og^{PT%j%jpaired\)Y, 

iJ d&0 

As ^ is the total probability under the 

model that positions / and ; are emitted from a rule r^, 
and is just the total probability under the 

d&O 

model that position / is emitted from a rule r^, the quan- 
tity P [d] log2 {Pt [d] ) can be computed using the 

d&0 

expected rule frequencies obtained through the inside- 
outside algorithm [39]. 

Methods 

Data 

StatAlign Dataset 

To test factors relating to alignment quality and secondary 
structure prediction quality, a large number of alignment 
samples from trusted reference alignments with known sec- 
ondary structures are needed. We have created a curated 
RNA dataset based on the Rfam database [40] for the pur- 
poses of evaluating the framework. Alignments of homolo- 
gous RNA sequences with known consensus secondary 
structure were extracted from Rfam seed alignments. From 
these, 50 RNA families with at least 50 sequences were ran- 
domly selected (see Additional file 1) in the section 
"StatAlign Dataset". From each family, in a pre-filtering step 
we removed divergent sequences with long indels, as fol- 
lows. We defined insertion as consecutive non-gap charac- 
ters of a sequence in the reference alignment which appear 
in columns where over 80% of the sequences have gaps. 



Deletions were defined analogously. Columns with fraction 
of gaps between 20% and 80% were regarded ambiguous 
and ignored. To over-penalize long indels, we applied the 
super linear score fianction / x log2 (/ + 1) for indels of 
length /, indels being defined as above. Then, a sequence 
was removed from a family if its total indel score was be- 
yond 20 and the difference between its indel score and the 
mean indel score in the family was beyond 3.7 times the 
standard deviation of the indel scores in family, i.e. if the se- 
quence had significantly more and/or longer indels than 
what is representative of the rest of the family. Then, 50 se- 
quences were selected at random, and fiirther random se- 
lection was done to get pairs, triplets etc. up to 15 
sequences in alignments. From these samples of known ref- 
erence alignment, we could produce many different align- 
ment samples using StatAlign [32]. For each RNA 
alignment, 200 alignment samples were taken, and the ref- 
erence alignments were also kept to for comparison. We 
refer to this dataset throughout as the StatAlign dataset. 

Random alignment data 

We also wanted to measure the effect of alignment accur- 
acy on secondary structure independently of alignment 
method. Therefore we created a dataset based on the RNA 
families of the StatAlign dataset, where alignments were 
sampled uniformly at various fixed distances from the ref- 
erence alignment. Using the alignment distance measure in 
Eq. 2, we created a Metropolis-coupled MCMC framework 
that runs several parallel MCMC chains to take alignment 
samples from the target distribution 

/ 2\d{a,a')-d\\ 
7T{a) = exp i^- ' ^ ^ ' j (9) 

where a^ is the reference alignment, d is the target distance 
to get samples from and t is the temperature of the chain. 
To improve the mixing properties of the chains we allowed 
each chain to explore alignments that do not exactly match 
the specified target distance (with an exponentially de- 
creasing probability, as described by Eq. 9) but then 
rejected non-exact matches when taking samples from the 
cold chain (^ = 1). 

The state space of the Markov chains is the set of all 
possible multiple alignments of the input sequences. 
Alignments that represent the same set of homology 
statements, and only differ by the order of the alignment 

A- 

columns, are treated as different (e.g. alignments 

—B 

-A 

and j5 of the sequences A and B). The following basic 
B— 

alignment rearrangement moves are iterated: 

1. breaking an alignment column into two columns by 
moving one of its characters into a new column, 
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placing gaps in every other row: 



c 


A 


G 


c 


A 


G 


c 


A 
1. 


G 


c 


-A 


G 


c 


A- 


G 


c 


A- 


G 



2. the exact reverse of the previous, i.e. joining two 
compatible adjacent columns - one having gaps in 
all but one row, the other having a gap in that row 
- to form a single column (see above) 

3. relocating one character of a column to a gap 
position of an adjoining column: 



c 


AG 


T 


c 


AG 


T 


c 


A- 

3. 


T 


c 


AG 


T 


c 


AG 


T 


c 


-A 


T 



As the space of alignments is vast and the moves are 
very local, to get a good approximation of uniform sam- 
pling, the number of steps that have to be done is on the 
order of millions, even for small inputs when a single 
chain is run. To this end, we created a special alignment 
representation where the above rearrangements can be 
carried out very efficiently (essentially constant time, i.e. 
in time proportional to column size but independent of 
the alignment length, even when the cost of randomly 
selecting the column to alter is included). For moderate 
input sizes (5-10 sequences of length 300-500) this rep- 
resentation allowed us to take 1.5 million rearrangement 
steps in a second (using a Java 6 implementation on one 
core of a 2.4 GHz Intel 13 processor). 

A single chain is sufficient for small alignments, but 
very slow mixing becomes an issue for practical align- 
ment sizes. We have found that standard parallel tem- 
pering techniques effectively speed up mixing if the 
chain temperatures are chosen correctly. Because the 
optimal temperatures vary significantly depending on 
reference alignment characteristics and target distribu- 
tion, a simple acceptance optimization routine was 
added that tunes chain temperatures to achieve chain 
swap acceptance ratios between neighboring chains 
around a pre-set value. We found ratios 0.7-0.8 to be 
the most effective. With this framework, running 8-10 
parallel chains was best to maximize the speed of con- 
vergence to the uniform distribution as compared to a 
single chain with sampling times 8-10-fold. 



The above described framework was utilized to create a 
dataset consisting of the RNA families of StatAlign 
dataset, where for each family and each selection of 5 
representative sequences from the family, 10 samples were 
taken at a distance corresponding to a similarity of 0.98 to 
the reference alignment (see Eq. 3), then 10 samples at a 
similarity of 0.96 etc., down to a similarity ratio of 0.6. We 
refer to this dataset as the Random Alignment dataset 

Extending information entropy to alignment space 

The information entropy defined for a single alignment 
contains the length of the alignment as a parameter. 
Attempting to extend the measure to the probability mass 
over RNA secondary structure space, variable alignment 
length is a concern. For example, if we have two align- 
ments and corresponding secondary structures 

(((( )))) ((((....)))) 

CCCCAAAA - GGGG CCCCAAAA GGGG 

CCCCAAA-AGGGG CCCCAAAAGGGG 

we would not want to suggest that these alignments give 
two different secondary structures. Consequently, we use 
a projection method to give alignment column pairing 
probability matrices the same dimension, so that the 
matrices can be averaged. 

For a given set of input sequences, the sequence 
containing the greatest number of non-gap characters was 
chosen as the reference sequence. Each pairing probability 
matrix is projected by deleting columns and rows of the 
matrix corresponding to gap positions in the reference se- 
quence, thus ensuring each matrix corresponding to an 
alignment sample has the same dimensions (a square 
matrix, with dimensions equal to the number of non-gap 
characters in the reference sequence). For example, we 
might start with an (w + 1) (« + 1) matrix, delete row / and 
column / due to a gap in position / in the reference se- 
quence, then end up with an n x n matrix as required. 

To calculate information entropy over alignments, we 
need to be able to calculate the probability of each align- 
ment. However, we cannot calculate the information en- 
tropy explicitly, since the probability of a given secondary 
structure ss is 

p[ss]= (10) 

alignments 

and there is no known efficient way to recurse over all 
possible alignments. Instead, we create an information en- 
tropy measure based on samples from the alignment 
space, and show that, in the sample-size limit, the 
alignment-sample information entropy tends to the true 
information entropy. 

Consider alignment samples a^) . . ., a^i from the space of 
all alignments of m sequences, sampled according to their 
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probability. If we are using statistical alignment, as in 
StatAlign, we will be sampling alignments in this fashion. 
Then we have for a column c, once alignments have been 
projected to the same length, the probability of being un- 
paired 

n 

P[c I c unpaired] = P[c\c unpaired in alignment 

i=l 

(11) 

with an analogous result holding for paired columns. We 
now define a sample phylogenetic probability P5 as the 
average of the sample phylogenetic probabilities: 

n ^ 

^^[c I c unpaired] = 7^ -P[c|c unpaired in alignment <3^/], 
n 

1=1 

(12) 

To show this sample probability converges to the true 
probability as sample size tends to infinity, we first note 
that, rearranging Equation 12: 

Ps [c I c unpaired] 

" 1 / \ 
= ^ P [c I c unpaired in alignment A] ] , 

j=l \A&alignments J 

(13) 

with 1 being the indicator function. Hence 

Pg [c I c unpaired] 

/ " 1 \ 

= P [c I c unpaired in alignment A] I j . 

Aealignments \i=l ^ J 

(14) 

Tal<ing the limit A2 ^ 00, by the weal< law of large numbers 

[c I c unpaired] ^ ^ P[c|c unpaired in alignment A] P [A] asn^^ 

alignments 

(15) 

= P[c\c unpaired] ( 16 ) 

as required. 

Now, we have from above that the entropy (P) of 
grammar derivations O of a grammar G is 

H^{P) = \og^{T)-l-Y,PG[d\PT[d\\ogM<i\) 

d<E0 

-^Y.PG[d]PT[d]log,iPT[d]) 

(17) 

Since tree probabilities are the product of unpaired 
and paired column probabilities in the derivation, the 
tree probabilities can be recalculated from the sample of 
alignments. These can then be substituted into the above 
equation to get an approximation to the information 



entropy over the space of sampled alignments as well. 
We refer to this entropy of more than one alignment as 
the alignment consensus entropy. 

Results and discussion 

Alignment quality and predictive accuracy 

A common question in comparative RNA secondary 
structure prediction is how many sequences are required 
to get a good structure prediction. This is briefly 
addressed in [11], but the sample size is quite small, and 
only total accuracy is considered. With 15 sequences in 
the alignment, we assume that no more evolutionary in- 
formation can be gained by adding further sequences, 
but when fewer sequences are present, lack of informa- 
tion might yield a poorer structure prediction. 

Instead of considering total accuracy, we wanted to 
quantify relatively how much accuracy is lost when fewer 
sequences are present. For example, if an alignment with 
15 sequences is predicted with an average F-score of 0.5, 
and an alignment of the same family with 3 sequences is 
predicted with an average F-score of 0.4, then 80% of the 
accuracy will have been retained, that is the alignment 
with 3 sequences has a relative F-score of 0.8. 

To investigate how many sequences are needed for 
a good structure prediction, we took the StatAlign 
dataset and considered the relative F-score for each 
family. Almost 100% of the possible F-score was 
achieved when the alignment contained 5 sequences, 
for both PPfold and RNAalifold. Interestingly, the ac- 
curacy of RNAalifold decreased slightly as the number 
of sequences was increased. This is due to the 
increased number of non-canonical base-pairs in the 
alignments, which the thermodynamic method could 
not predict. PPfold, on the other hand, has a small 
probability for non-canonical base-pairs, so is not 
affected by these in the same way. Overall these 
results suggest that 5 sequences are sufficient for 
approaching maximal predictive accuracy. 

To consider the effect alignment quality has on RNA 
secondary structure prediction, we took the StatAlign 
and Random Alignment datasets and measured their 
similarity to the reference alignment using the similarity 
score above. Again, percentage of accuracy retained was 
calculated by normalizing against the accuracy achieved 
on the reference alignment. Log-scale heatmaps showing 
the accuracy and percentage of accuracy retained for the 
StatAlign dataset (A) and Random Alignment dataset 
(B) for PPfold and RNAalifold can be seen in Figure 2. 
As expected, decreasing alignment quality decreases the 
accuracy of structure predictions. However, other pat- 
terns also emerge from these graphs. 

Firstly, in the StatAlign dataset (Figure 2A), we ob- 
serve a weaker correlation between alignment distance 
and accuracy than for the Random Alignment dataset 
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(Figure 2B). This suggests that predictions are better for 
the StatAlign dataset on alignments far from the refer- 
ence alignment. StatAlign looks to produce alignments 
with a high likelihood under an evolutionary model, 
which the random alignments do not consider, and so 
StatAlign's alignments could be considered more realis- 
tic. This con rms the expectation that StatAlign's align- 
ments are more useful for RNA secondary structure 
prediction than random alignments. 

Secondly, for the StatAlign dataset, we see a much 
higher correlation with the F-score than with the relative 
F-score. Some families of RNA consistently produce the 
same alignment, which skews the graphs. For example, 
an alignment which consistently has similarity to the ref- 
erence alignment of 0.9, and F-score 0.5 would give a 
relative F-score of 1 every time, and would support the 
correlations seen on each graph. Because we can control 
the spread of distances in the random alignment data 
set, we don't see this behaviour. As expected, variation 
comes more with families that produce more variable 
alignments. In the StatAlign dataset, this is obscured by 
those families which produce consistent alignments. 

Lastly, for the Random Alignment dataset (Figure 2B), 
we see many more zero quality predictions in the case of 
RNAalifold than in the case of PPfold. This is most eas- 
ily seen by the larger intensity of red in the main body 
of the heatmap for PPfold. This suggests that PPfold 
functions better than RNAalifold when given a low- 
quality alignment, likely due to its more complete model 
for molecular evolution. 

Evolutionary distance 

We also consider the effects of evolutionary distance on 
RNA secondary structure prediction quality. One might 
expect that there is a "sweet spot" for evolutionary dis- 
tance- sequences too close to each other do not display 
enough co-variation to benefit the evolutionary model, 
but the evolutionary signal might be lost if the distance 
is too large. To investigate this, we measured evolution- 
ary distance in the phylogenetic trees predicted by 
PPfold using four different measures: 

Measure 1- Mean of all the evolutionary distances, 
Measure 2- Standard deviation of all the evolutionary 
distances, 

Measure 3- Maximum evolutionary distance, 
Measure 4- Maximum difference between evolutionary 
distances. 

All four measures would be expected to be correlated 
with sequence length, which is well known to correlate 
with predictive accuracy. To account for this, we consid- 
ered relative evolutionary distance, similar to the relative 
predictive accuracy above. A measure was normalized by 



the average for that family and number of sequences, so 
that it could be seen whether an alignment had greater 
or less evolutionary distance than might be expected. 
We then looked at the correlation between relative evo- 
lutionary distance and predictive accuracy. All methods 
correlated extremely poorly with predictive accuracy and 
relative predictive accuracy, measures 1 to 4 having 
correlations with relative predictive accuracy of 0.0259, 
0.0373, 0.0303, and 0.0265 respectively (data not shown). 
This suggests that evolutionary distance is not an under- 
lying factor for variation in RNA secondary structure 
prediction, and those other factors, such as those seen in 
[26], play a more important role in determining predict- 
ive accuracy. 

Alignment distances and maximum posterior decoding 

Given the results concerning accuracy lost as alignment 
quality decreases, it would be desirable to be able to 
predict alignment quality, with the hope of predicting 
structure prediction quality. This has previously been 
attempted in [36] . First, the sequences were aligned with 
ClustalW [18]. The sequences were then re-aligned 
using 4 other programs (Align-m [41], MUSCLE [42], 
Prob-Cons [43], and T-Coffee [44]) and the similarity 
between the alignment generated using ClustalW to 
each of the 4 other alignments was measured. The max- 
imum of the 4 similarities, max (g), was chosen as a pre- 
dictor of alignment quality. The authors of [36] detected 
a strong correlation between the true similarity (the 
similarity between the ClustalW alignment and a refer- 
ence alignment) and max (g). 

We implemented a modified version of this method. 
For a given set of input sequences we aligned with both 
AMAP [36] and with StatAlign, obtaining the maximum 
posterior decoding alignment (MPD alignment) from 
StatAlign. The similarity between the AMAP alignment 
and the MPD alignment was used as our predicted simi- 
larity measure. This produced a strong correlation be- 
tween our predicted similarity and true similarity, with 
an adjusted R-squared value of 0.6524. 

We also implemented another method, which cal- 
culates an estimate of the expected similarity score 
using posterior probabilities from the MPD align- 
ment. For each column, we might expect that a pos- 
terior probability close to 1 would contribute a 
score of close to 1 to the similarity measure. So our 
predicted alignment distance is just the average of 
the column posterior probabilities. Figure 3 shows 
an example of the correlation between predicted 
similarity and true similarity, here giving an ad- 
justed R-squared value of 0.8403. Our new predicted 
similarity can be calculated efficiently, and is a 
strong predictor of true similarity to the reference 
alignment. 
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reference alignment for 50 Rfam families with 5 sequences in. Predicted similarity was calculated by averaging the posterior probabilities given in 
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Extending information entropy to alignment space 

To test the information entropy extension developed 
above, we calculated the alignment consensus entropy 
for samples of alignments from the StatAlign dataset. 



representative RNA families from the StatAlign dataset. 
On each graph, the information entropy for each of 
1000 statistical alignment samples is given, as well as the 
alignment consensus entropy. The leftmost figure is one 
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Figure 4 Information entropy with alignment uncertainty. Information entropy calculated for each alignment sample (yellow) and the 
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details on entropy calculation. 
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rightmost figure where the alignment samples were very 
different, and the middle figure closer to the median 
value of alignment sample similarity. For family 3 (where 
sample alignments had low diversity), we see that the 
alignment consensus entropy is comparative to the mean 
entropy of the individual samples. This is expected, as it 
indicates there is little uncertainty in the alignment. On 
the other hand, the high-diversity family has much 
higher alignment consensus entropy than for each indi- 
vidual sample. This is again expected, as the difference 
in entropies indicates a high uncertainty in alignment. In 
this way, we can incorporate alignment uncertainty into 
our understanding of comparative RNA secondary struc- 
ture prediction. 

Predicting secondary structure accuracy 

Given strong correlations between the alignment quality 
and the structure prediction quality, we might expect 
that we could find a predictor of structure prediction 
accuracy. By "integrating out" alignment uncertainty, we 
may find a reliability score which is more reliable than 
the one currently reported by the PPfold. To test this, 
we predicted accuracy for one of the five-sequence 
alignments of each family and then tested the predicted 
accuracy against the true accuracy. The PPfold reliability 
score produced an adjusted score of 0.252 when 
considering correlation with the true F-scores. 



For our new reliability score, we adjusted the PPfold 
reliability score to consider only base-pairs, as the F- 
score considers only base-pairs (i.e. ignored unpaired 
nucleotide probabilities). We then performed linear re- 
gression with the average of the MPD column probabil- 
ities, the information entropy of the alignment space, 
and this pairs -only reliability score against the known F- 
score measure. This multiple regression improved the 
reliability score significantly. Figure 5 shows the pre- 
dicted F-score against the true F-score, for a randomly 
chosen five-sequence alignments from each family of the 
Stat Align dataset. The adjusted value with the new 
reliability measure improved to 0.496. These results 
seem to indicate that while alignment quality does affect 
structure prediction quality, the actual structure predic- 
tion model still plays a great role in the overall predic- 
tion accuracy. Consequently, improving these models, 
possibly by incorporating other kinds of information 
(such as experimental probing data), is an area where 
new research e orts are still needed in RNA secondary 
structure prediction. 

Conclusions 

In this paper we have explored a number of factors 
which can contribute to poor RNA secondary structure 
prediction quality. We established a relationship between 
alignment quality and expected loss of accuracy. 
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MPD alignment. The adjusted R-squared value a linear fit is 0.496. 
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Furthermore, we provided a method to predict align- 
ment quality based only on statistical alignment samples. 
\X1iile our predictor of accuracy improves on the PPfold 
reliability score, there is still a large amount of variability 
unaccounted for, which we therefore suggest comes from 
the predictive model itself. To consider this further, we 
extended the information entropy measure for SCFGs to 
consider uncertainty in alignments. 

The fact that our accuracy predictor did not account 
for all the variances associated with RNA secondary 
structure prediction, despite good predictors being 
found for alignment quality and a strong correlation be- 
tween alignment quality and predictive accuracy, sug- 
gests that whilst alignment quality is an important 
factor, the predictive model itself determines plays a 
large part in the quality of prediction. Given what is 
shown in Figure 1 for single sequence predictions, that 
the accuracy of PPfold and RNAfold is very variable, it is 
unsurprising that variances remain. Clearly then, further 
efforts should be put into creating stronger single- 
sequence models, and then the advantages of evolution- 
ary modelling and additional structural constraints will 
benefit further. The use of experimental data from new 
probing experiments as well as more biologically realistic 
constraints, such as kinetic or co-transcriptional folding, 
may improve the results of RNA secondary structure 
prediction. 
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Additional file 1: Information on the 50 RNA families randomly 
selected from Rfam. 
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